background image used for decoration
logo of company

Welcoming document


This document provides an overview of useful information when coming to the EU Tax Observatory.

Author: EU Tax Observatory

Date: June 19, 2025

Coding and data

We first present the main principles for achieving reproducible results, whereas the second part provides empirical advice for facilitating collaboration on research projects and ensuring good monitoring.

Maybe a motivation for these good practices

Reproducibility is a key step to ensure the credibility of the results. We detail key steps to follow: transparency, comments on coding scripts, data accessibility, and software versioning.

Transparency

This principle is quite simple: every methodological part, every choice, and every coding line must be accessible to outsiders.

To expand: why transparency matters? Maybe motivate through some scientific scandals

Comments

Coding files can be hard to understand, especially for non-experts of the language of consideration. Moreover, data processing can be achieved following different packages, especially in R, which might make code reading difficult for outsiders. For instance, let’s take the following example.

df_data[, u_p := val / sur]
df_data <- df_data[u_p > 120 & u_p < 20000]
df_data[, sur := NULL]
lm(u_p ~ y_sale + h_type + l_mut, data = df_data)

gen u_p = val / sur
keep if u_p > 120 & u_p < 20000
drop sur
regress u_p y_sale i.h_type i.origin

If you are not an R expert, understanding such a chunk of code might difficult. Hence, assessing whether there is an issue or a bad methodological choice is highly challenging. Now, let’s look at the commented version of this script.

# we create a new column `u_p` that represent the housing unitary price: 
# housing value (`val`) divided by surface (`sur`)
df_data[, u_p := val / sur]

# we now filter observations based on the housing unitary housing price.
# we keep observations above 120 and under 20,000
df_data <- df_data[u_p > 120 & u_p < 20000]

# we remove the column surface
df_data[, sur := NULL]

# finally, we regress the unitary housing price on multiple variables:
# - y_sale: year of sale
# - h_type: housing type
# - l_mut: last date of mutation
# coefficients are obtained through OLS

lm(u_p ~ y_sale + h_type + l_mut, data = df_data)
* Create the new variable u_p
gen u_p = val / sur

* Keep observations where u_p is between 120 and 20000
keep if u_p > 120 & u_p < 20000

* Drop the variable sur
drop sur

* Run the linear regression
regress u_p y_sale i.h_type i.origin

Comments are extremely helpful for others to understand your code. As a result, they can more easily spot any coding mistakes or poor methodological choices.

Additionally, consider revisiting an old project six months or a year later. Your coding practices may have evolved, and you might have switched packages, making it more challenging to navigate through your code. This is particularly true for lengthy code files, which can be difficult to comprehend after being away from them for several months!

However, for straightforward code, comments are not mandatory as they complicate the reading of the code. Hence, commenting code is a balance between having a clear explanation about what is performed and keep a code as clean as possible. Performing code review (see Section 2.6) helps to adjust comments on code.

Data accessibility

The data used for the empirical analysis are a key information to provide to ensure consistent and reproducible information. Hence, we need to detail following information:

  • Data source: where can outsiders access the data?
  • Data version: for updated datasets, specify the version used in this project (especially for flow data)
  • Metadata

Software versionning

Finally, a point that is mostly omitted in reproducible coding is software and package versioning. Package or software might evolve which might introduce some bugs in your code. For instance, a code written in 2019 might be broken if we use the current packages. Hence, we need to specify the version of the main software (R, Python, or Stata) and also the attached version of the packages used. For R, we provide the following code to access software and packages versions.

# it provides information about the R version being used
# also, it list all packages being installed in your session
sessionInfo() 
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin24.4.0
Running under: macOS Sequoia 15.6

Matrix products: default
BLAS:   /opt/homebrew/Cellar/openblas/0.3.30/lib/libopenblasp-r0.3.30.dylib 
LAPACK: /opt/homebrew/Cellar/r/4.5.1/lib/R/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Paris
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] languageserver_0.3.16 mgcViz_0.2.1          qgam_2.0.0            pbapply_1.7-4         scales_1.4.0         
 [6] fixest_0.12.1         readr_2.1.5           RPostgres_1.4.8       DBI_1.2.3             data.table_1.17.8    
[11] mgcv_1.9-3            nlme_3.1-168          sf_1.0-21             lubridate_1.9.4       RColorBrewer_1.1-3   
[16] ggpubr_0.6.1          tikzDevice_0.12.6     stringr_1.5.1         xtable_1.8-4          ggplot2_3.5.2        

loaded via a namespace (and not attached):
  [1] Rdpack_2.6.4        gridExtra_2.3       sandwich_3.1-1      rlang_1.1.6         magrittr_2.0.3     
  [6] multcomp_1.4-28     dreamerr_1.5.0      compiler_4.5.1      matrixStats_1.5.0   e1071_1.7-16       
 [11] callr_3.7.6         vctrs_0.6.5         pkgconfig_2.0.3     fastmap_1.2.0       backports_1.5.0    
 [16] promises_1.3.3      rmarkdown_2.29      tzdb_0.5.0          ps_1.9.1            nloptr_2.2.1       
 [21] purrr_1.1.0         bit_4.6.0           xfun_0.52           jsonlite_2.0.0      stringmagic_1.2.0  
 [26] blob_1.2.4          later_1.4.2         broom_1.0.9         parallel_4.5.1      R6_2.6.1           
 [31] stringi_1.8.7       GGally_2.3.0        car_3.1-3           boot_1.3-31         numDeriv_2016.8-1.1
 [36] estimability_1.5.1  knitr_1.50          Rcpp_1.1.0          iterators_1.0.14    zoo_1.8-14         
 [41] filehash_2.4-6      httpuv_1.6.16       Matrix_1.7-3        splines_4.5.1       timechange_0.3.0   
 [46] tidyselect_1.2.1    yaml_2.3.10         viridis_0.6.5       dichromat_2.0-0.1   abind_1.4-8        
 [51] doParallel_1.0.17   codetools_0.2-20    processx_3.8.6      lattice_0.22-7      tibble_3.3.0       
 [56] plyr_1.8.9          shiny_1.11.1        withr_3.0.2         S7_0.2.0            evaluate_1.0.4     
 [61] coda_0.19-4.1       survival_3.8-3      ggstats_0.10.0      units_0.8-7         proxy_0.4-27       
 [66] xml2_1.3.8          pillar_1.11.0       carData_3.0-5       KernSmooth_2.23-26  foreach_1.5.2      
 [71] reformulas_0.4.1    generics_0.1.4      hms_1.1.3           minqa_1.2.8         gamm4_0.2-7        
 [76] class_7.3-23        glue_1.8.0          emmeans_1.11.2      tools_4.5.1         lme4_1.1-37        
 [81] ggsignif_0.6.4      mvtnorm_1.3-3       grid_4.5.1          tidyr_1.3.1         rbibutils_2.3      
 [86] Formula_1.2-5       cli_3.6.5           viridisLite_0.4.2   dplyr_1.1.4         gtable_0.3.6       
 [91] rstatix_0.7.2       digest_0.6.37       classInt_0.4-11     TH.data_1.1-3       htmlwidgets_1.6.4  
 [96] farver_2.1.2        htmltools_0.5.8.1   lifecycle_1.0.4     mime_0.13           bit64_4.6.0-1      
[101] MASS_7.3-65        
# it returns the current version of package of interest
packageVersion("data.table")
[1] '1.17.8'
* Display Stata version information
version

* Display information about installed packages
ado dir

* To get more detailed information about a specific package, you can use:
ado describe <package_name>

Here, the current version of R is 4.4.3, whereas the version of the data.table package is 1.17.0. I need to share this information to ensure that the results are fully reproducible, even in 10 years. An alternative is to use a Docker file which provides all relevant files and packages to run your script and replicate your results.

Empirical advice

In this section, we provide some empirical advice to follow to monitor and manage large project with multiple data sources and collaborators. First, we present the use of a general logfile to ease navigating through the code. Second, we explain how to share code and keep tracks of changes in the script over time, using software like GitHub. Finally, we provide some general advice to organize a working directory.

The logfile

To monitor projects, and more specifically long-term projects, having a logfile is highly advised. This file registers all modifications being made with information such as the date, the author, and a brief explanation. It helps to track the evolution of the project, the main changes within the project, and also individuals’ contribution to the coding part of the research project. Here is a snapshot of the logfile used for an ongoing project:


# To-do
- [ ] Make some descriptive statistics about tenure status
- [ ] Merging and comparison with Orbis database
- [ ] Need to assess the quality of changes within the BO register
- [ ] Launch a new collection for 2025
- [ ] Outcomes about rental eviction, housing maintenance and renters income to compute
- [ ] Access to the TVVI database

# Previous changes

## 25.05.06

- Code for the internal workshop is ok. RL 2db4c55 
- Figures are currently working. First push in a GitHub repo in the next weeks. RL 2db4c55
- Code review. RL 2db4c55

## 25.04.20

- New code to have matching rate per percentile. RL 688eea8
- We also account for the individual shares and highlight that the 25% rule is a main limit for transparency. RL 688eea8
- New map for Paris. RL 688eea8
Note

The logfile is most of the time written in lightweight open source format such as txt or Markdown to be easily readable and writable, regardless the OS.

The logfile might also contain a to-do section. Research projects are not always linear, and writing future steps to introduce in this file is useful. Indeed, when we re-open a new project, looking at a logfile provides us a good overview of what was achieved and what is needed.

Finally, sharing a logfile between contributors is a key element to coordinating your efforts. Having a look at the logfile enables all team members involved in the project to see what are the future steps to be implemented, whereas they can easily monitor previous achievements. The authorship makes communication easier, especially if there is a misunderstanding in the coding parts.

README file

The README is probably the first file an outside individual will open when accessing your research project. In your README file, you must put important information such as:

  • Title and authorship
  • The main objective of the research project
  • Information about how to access the data
  • What are the main software being used in the project
  • The license of your code, in our case this is mostly open, but it depends
  • Any explanation that helps individuals to understand you repository
Note

The README is always written in open file format, mainly Markdown.

An example of a README file


# Name of the project

## Overview
Here we discuss the objective of the project, a snapshot of the main conclusion, and potential redirection to the paper.

## Features of the code
List big steps of your code being accessible. For instance:

- Data collection
- Data cleaning
- Filtering of the dataset
- Descriptive statistics
- Econometric analysis

## License
This project is licensed under the MIT License. See the LICENSE file for repo details.

## Contact
For any questions or feedback, please contact:

Your Name: 
GitHub: 

Sharing code and versionning

Whereas comments on coding scripts, data accessibility, and transparency are the key elements for full reproducibility, the main issue is how to collaborate and have a consistent repository between the collaborators of the project. Let’s say that two members work jointly on the coding part, keeping the script up to date for both is challenging. Moreover, let’s say that you want to reproduce a figure that was produced six months ago in your script, how is it possible to smoothly do it?

The key is to have a shared repository for all team members. In such way, every change is transparent and easily accessible. In addition, you need to have all version of the coding scripts.

GitHub enables us to achieve both objectives. First, it provides consistent framework to collaborate between team members on the script. In addition, it highlights every change between two versions of code which make tracking changes from other members easier. Second, it provides all version of coding scripts (even the script at the beginning of the project), without the need to specify _v2, _v3, _v100 which are very painful on the long term!

So, how to use GitHub? You don’t need to be a nerd with Terminal applications being used to manage a GitHub project. The GitHub Desktop enables smoothly managing such a project and provide interface to easily monitor changes over time.

Let’s talk about the main actions for a GitHub project. For setting up a repository, see here. We now have a repository with two team members such as

---
config:
  theme: neutral
  max-width: 600
---

flowchart TB;
    A[Member 1] --> B(Project);
    C[Member 2] --> B;
Figure 1: Coding Project

Let’s say that the member 1 works on the project and make some significant changes in this own laptop. For now, changes made by Member 1 are not shared with Member 2. It has to commit and push its changes.

---
config:
  theme: neutral
  max-width: 600
---

flowchart TB;
    A[Member 1] -->|Commit| B;
    C[Member 2] --> B;
Figure 2: Coding Project

Now, consider that changes made by Member 1 are committed and pushed. The Member 2 wants to work on the project and check the track changes. It has to pull the project on its laptop to see any changes. After that, both repository are synchronized.

---
config:
  theme: neutral
  max-width: 600
---

flowchart LR;
    A[Member 1] --> B(Project);
    B --> |Pull| C[Member 2];
Figure 3: Coding Project

In addition, the Member 2 can monitor changes being achieved by Member 1 in more details, without reviewing all the code. Many software offer this possibility, but the GitHub Desktop offers good highlight such as

Tracking changes
Note

Here, lines highlighted in green are added by Member 1, whereas lines highlighted in red are removed by Member 1. The stable part of the coding file is uncolored (not applicable here).

In the end, every member can trace back every change in the coding file using the number of the commit. By doing so, it ensures to keep every version of the coding files. Every member can thus generate every figure or table, regardless of which version the figure or table was. To sum up, the project is composed of numerous evolutions that can be identified by their number such as

---
config:
  theme: neutral
---

gitGraph
    commit
    commit
    commit
    commit
    commit
    commit
Figure 4: Git

Hence, we advise you to put this number in your logfile when making a commit and a significant change in the project. It helps to monitor and follow the evolution of the code.

Finally, the GitHub environment is not accessible in the CASD for security reasons. In the CASD, the repository is shared between all team members. However, there is no system to track changes in code between members. However, it is possible to use Git locally (the software behind GitHub) to monitor code in the same way locally and benefit from versioning and changes highlighting.

Organizing directory and files

Besides the scripts, structuring a coding directory is also a good practice to have. First, one script to do everything must be avoided. Research projects can be large, including some data loading, data management, descriptive statistics, econometric analysis. One standalone file would be too large to be easily understandable by team members or outsiders.

On the other hand, spread-out files are hard to understand. Let’s say you join an ongoing project with multiple files to be handled. How to know which is the first to execute? Order in scripting files is a key element to ensure full reproducibility. Let’s say you run the econometric analysis before the filtering step, results would be dramatically different. Hence, one file must aggregate everything and call each subscript.

We can call this script main_script and structure the code as follows (example from an ongoing project)

################################################################################
# INTREALES Project Code Preamble
################################################################################

# -----------------------------------------------------------------------------
# Project Information
# -----------------------------------------------------------------------------

# Author:  Author 1, Author 2, Author 3
# Title:   Code of super cool project
# Date:    2025-04-08
# Version: 1.0

# -----------------------------------------------------------------------------
# Load Necessary Libraries
# -----------------------------------------------------------------------------

# load all relevant packages for the analysis
source("init/packages.R")
theme_update(text = element_text(family = "serif"))
# -----------------------------------------------------------------------------
# Additional Setup or Configuration
# -----------------------------------------------------------------------------

output_table <- "output_code/table/" ## location of table outputs
output_figure <- "output_code/figure/" ## location of figure outputs
choice_w <- 16 # width of the output graphics (in inch)
choice_h <- 9 # height of the output graphics (in inch)

# -----------------------------------------------------------------------------
# Main Code
# -----------------------------------------------------------------------------

################################################################################

# Loading data -----------------------------------------------------------------

source("code/data/01_loading_data.R")
source("code/data/02_filtering_data.R")

# Descriptive statistics about the topic of interest ---------------------------

source("code/descriptive_statistics/01_summary_stat_sample.R")
source("code/descriptive_statistics/02_stat_observation_interest.R")

# Running an econometric analysis ---------------------------------------------

source("code/econometric_analysis/01_diff_in_diff.R")
source("code/econometric_analysis/02_robustness_chekcs.R")
source("code/econometric_analysis/03_placebo.R")
* ################################################################################
* INTREALES Project Code Preamble
* ################################################################################

* -----------------------------------------------------------------------------
* Project Information
* -----------------------------------------------------------------------------

* Author:  Author 1, Author 2, Author 3
* Title:   Code of super cool project
* Date:    2025-04-08
* Version: 1.0

* -----------------------------------------------------------------------------
* Load Necessary Libraries
* -----------------------------------------------------------------------------

* In Stata, we typically use `ssc install` or `net install` to install packages.
* For example, to install a package, you might use:
* ssc install package_name

* -----------------------------------------------------------------------------
* Additional Setup or Configuration
* -----------------------------------------------------------------------------

* Define output directories
global output_table "output_code/table/" ## location of table outputs
global output_figure "output_code/figure/" ## location of figure outputs

* Define graphics dimensions
global choice_w = 16 # width of the output graphics (in inch)
global choice_h = 9 # height of the output graphics (in inch)

* -----------------------------------------------------------------------------
* Main Code
* -----------------------------------------------------------------------------

* ################################################################################

* Loading data -----------------------------------------------------------------

do "code/data/01_loading_data.do"
do "code/data/02_filtering_data.do"

* Descriptive statistics about the topic of interest ---------------------------

do "code/descriptive_statistics/01_summary_stat_sample.do"
do "code/descriptive_statistics/02_stat_observation_interest.do"

* Running an econometric analysis ---------------------------------------------

do "code/econometric_analysis/01_diff_in_diff.do"
do "code/econometric_analysis/02_robustness_checks.do"
do "code/econometric_analysis/03_placebo.do"

The main script is quite simple to read regardless you are an R expert. The objective is to provide all needed steps with some comments to understand each step. The script can be decomposed in different sections.

First, we introduce a preamble. We provide the key information about the project such as the persons involved in it, the date, and the objective of the project. Second, we load everything we need. It avoids loading multiple times a package. Also, it is helpful when you need to provide all relevant packages to be loaded to an alternative member. You can also easily list all your packages when you need to provide the version to ensure full reproducibility. Third, we load the data. Then, we run the analysis.

Naming files is an important thing. When you open a coding directory, you want to understand easily how the directory is structured. That’s why we advise you to store your subscript within subdirectories with explicit names such as descriptive statistics, data, or econometric_analysis. It helps to navigate through the directory and understand the code and the underlying choices (which is, in the end, the main thing). Within each subdirectory, we advise you to number coding files, just to remind you in which order these scripts must be executed. In the same fashion, naming objects within the script should be as clear as possible to ensure team members understand what is what (but remind you that comments are helpful also!).

Finally, we can sum up the structure of the project as follows in a more general manner.

  • code
    • README.md
    • main_script.R
    • logfile.md
    • descriptive_statistics
      • 01_stat_code.R
      • 02_stat_code.R
    • loading_data
      • 01_loading_data.R
      • 02_filter_data.R
    • econometric_analysis
      • 01_diff_in_diff.R
      • 02_robustness.R
  • output_code
    • figure
      • fig_stat_des.png
      • fig_stat_des_sample.png
    • table
      • tab_summary_stat.tex
      • tab_main_results.tex
  • code
    • README.md
    • main_script.do
    • logfile.md
    • descriptive_statistics
      • 01_stat_code.do
      • 02_stat_code.do
    • loading_data
      • 01_loading_data.do
      • 02_filter_data.do
    • econometric_analysis
      • 01_diff_in_diff.do
      • 02_robustness.do
  • output_code
    • figure
      • fig_stat_des.png
      • fig_stat_des_sample.png
    • table
      • tab_summary_stat.tex
      • tab_main_results.tex

Here, we have two main directories (code and output_code). The second directory contains all figures and tables being produced by the script. Hence, it is easy to search for output figures.

Note

This is just general advice! Of course, it depends on the project, the goal to be achieved, and keeping some flexibility is always a good idea!

Storing data

The main script is quite simple to read regardless you are an R expert. The GitHub is a platform to share and collaborate on code. But, how about data? Generally speaking, we should avoid storing any data on it for two reasons. First, it relates to the sensitivity of the data.

To store data securely, you can use the PSE NextCloud solution. Each member of the EU Tax Observatory is a member of PSE NextCloud which offers a Cloud solution for storing files, documents, or data. It is similar to common Cloud services such as Dropbox, but the servers are located in Paris, within PSE. Hence, it provides a way to store and share data between members which is compliant with the RGPD. You can improve the security level of this repository by adding passwords or time restrictions.

Note

In R, there is a package to link your coding repository with your NextCloud data repo.

Besides the storage aspect, it is important to store data on a easily readable format such as csv or parquet to ensure an easy opening for other members or replicators. Proprietary format such as Excel should be avoided.

Code review

Finally, mistakes in coding files are common. Performing time to time some code review might be a good practice to ensure that everything runs as planned. For instance, you can add a 0 when filtering data that affects the composition of your sample, introduce a filter to simplify the data process in the first place that remains, or just comment on some parts of the code that are useful. Making code review helps to track and correct these potential mistakes. When you are writing your coding file, you don’t have necessarily the hindsight to identify all issues (think about when writing a draft, you have some inconsistency all along).

A code review is just looking at the entire code sequentially and ensuring that everything is fine. First, you must be sure that your code is not bugged when you are running it. If it happens, you must correct it. Second, you must track any typo issues such as the filtering process, wrong column assignment, etc. that might affect the results. Third, you may find your code too complex. If you can simplify it through a partial re-writing process, you should do it. Even if comments help understand your process, the easier your code is, the more understandable is.

Additional resources about coding

You can find here some resources:

Drafting technical documents

Working practices

working progress

Suggestion made by Ninon:

  • Add some general advice about how to communicate with the senior researcher